## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## [1] 113937 81
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
There are 81 variables with 113937 observations. Plotting a histogram for each variable will be helpful in understanding the data set. I will only use variables that are not factors.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ListingKey ListingNumber
## 0 0
## ListingCreationDate CreditGrade
## 0 0
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 0 25
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 29084 29084
## EstimatedReturn ProsperRating..numeric.
## 29084 29084
## ProsperRating..Alpha. ProsperScore
## 0 29084
## ListingCategory..numeric. BorrowerState
## 0 0
## Occupation EmploymentStatus
## 0 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 7625 0
## CurrentlyInGroup GroupKey
## 0 0
## DateCreditPulled CreditScoreRangeLower
## 0 591
## CreditScoreRangeUpper FirstRecordedCreditLine
## 591 0
## CurrentCreditLines OpenCreditLines
## 7604 7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## 697 0
## OpenRevolvingMonthlyPayment InquiriesLast6Months
## 0 697
## TotalInquiries CurrentDelinquencies
## 1159 697
## AmountDelinquent DelinquenciesLast7Years
## 7622 990
## PublicRecordsLast10Years PublicRecordsLast12Months
## 697 7604
## RevolvingCreditBalance BankcardUtilization
## 7604 7604
## AvailableBankcardCredit TotalTrades
## 7544 7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## 7544 7544
## DebtToIncomeRatio IncomeRange
## 8554 0
## IncomeVerifiable StatedMonthlyIncome
## 0 0
## LoanKey TotalProsperLoans
## 0 91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 91852 91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 91852 91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 91852 91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 95009 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination
## 96985 0
## LoanNumber LoanOriginalAmount
## 0 0
## LoanOriginationDate LoanOriginationQuarter
## 0 0
## MemberKey MonthlyLoanPayment
## 0 0
## LP_CustomerPayments LP_CustomerPrincipalPayments
## 0 0
## LP_InterestandFees LP_ServiceFees
## 0 0
## LP_CollectionFees LP_GrossPrincipalLoss
## 0 0
## LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments
## 0 0
## PercentFunded Recommendations
## 0 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount
## 0 0
## Investors
## 0
This will help me filter out variables with too many NA’s.
There are only 3 terms people use. More specifically,
##
## 12 36 60
## 1614 87778 24545
12, 36 and 60. Term will be used as a factored variable.
## Warning: Removed 25 rows containing non-finite values (stat_bin).
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
The graphs are similar to each other. Right skewed graph with an abnormal peak at the right. I will only use Borrower Rate for convenience.
Without the high peak around at .36 the histogram is almost a right skewed graph. According to the graph, there is a very popular APR near at .36. Let’s find the exact value of it.
## Source: local data frame [2,294 x 2]
##
## BorrowerRate Count
## (dbl) (int)
## 1 0.3177 3672
## 2 0.3500 1905
## 3 0.3199 1651
## 4 0.2900 1508
## 5 0.2699 1319
## 6 0.1500 1182
## 7 0.1400 1035
## 8 0.1099 949
## 9 0.2000 907
## 10 0.1585 806
## .. ... ...
.3177 is the most popular APR being applied. Cutting the variable and using it as a factored variable might be more useful.
##
## (0.00602,0.0571] (0.0571,0.108] (0.108,0.158] (0.158,0.209]
## 70 8501 21072 27021
## (0.209,0.259] (0.259,0.31] (0.31,0.361] (0.361,0.411]
## 21734 17373 15481 2598
## (0.411,0.462] (0.462,0.513]
## 58 4
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
This looks a right skewed graph with outliers less than 0.0.
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
The graph with more bins shows clear outliers with negative values and more than 0.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
##
## (-0.1,0] (0,0.2] (0.2,0.3]
## 176 84577 80
There are 176 observations less than 0.0 and 80 observations more than 0.2.
The graph is drawn after Estimated Return is filtered (more than 0 and less than 0.2). Similar to the previous variables, there is a peak at the right.
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
##
## 1 2 3 4 5 6 7 8 9 10 11
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456
Prosper score is ranging from 1 to 11.
Prosper score is factored.
## Warning: Removed 7625 rows containing non-finite values (stat_bin).
This is very nice looking right skewed graph without a peak at the right.
## Warning: Removed 7625 rows containing non-finite values (stat_bin).
Square root is applied to see a clear right skewed shape.
## Warning: Removed 591 rows containing non-finite values (stat_bin).
## Warning: Removed 591 rows containing non-finite values (stat_bin).
These two histograms are very similar. Instead of using these 2 variables, I will create a new variable ‘CreditScoreRangeMid’ that is the average of these 2.
## Warning: Removed 591 rows containing non-finite values (stat_bin).
## Warning: Removed 7604 rows containing non-finite values (stat_bin).
## Warning: Removed 7604 rows containing non-finite values (stat_bin).
## Warning: Removed 697 rows containing non-finite values (stat_bin).
These are nice looking right skewed histograms.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 15.00 22.00 23.23 30.00 126.00 7544
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
There is an outlier at 10.0.
The graph above shows the Debt to Income Ratio that are less than or equal to 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
## n
## 1 1140
There are 1140 people with monthly income greater than 20526.67. Most of the peole’s income is around at 4667.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.02859 0.04962 Inf 0.07865 Inf 15
The new variable of the ratio of monthly loan payment to monthly income is created but the graph is not very good since there are many outliers.
## (-0.01,0.5] (0.5,1.26e+04] NA's
## 113355 581 1
There are only 582 number of observations that are larger than 0.5.
ggplot(aes(x=ratio_monthly_loan_payment), data=subset(df,ratio_monthly_loan_payment<.5)) +
geom_histogram(bins=100) +
ggtitle("Ratio_monthly_loan_payment Histogram (Filtered)")
The above histogram is plotted with subset of the data set where ratio_monthly_loan_payment is less than 0.5. The right skewed graph can be seen clearly.
## 'data.frame': 113937 obs. of 87 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ BorrowerAPR.bucket : Factor w/ 10 levels "(0.00602,0.0571]",..: 4 3 6 3 5 3 7 5 2 2 ...
## $ EstimatedReturn.bucket : Factor w/ 3 levels "(-0.1,0]","(0,0.2]",..: NA 2 NA 2 2 2 2 2 2 2 ...
## $ ProsperScore.numeric : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ CreditScoreRangeMid : num 650 690 490 810 690 ...
## $ ratio_monthly_loan_payment : num 0.1072 0.0521 0.0592 0.1118 0.0588 ...
## $ ratio_monthly_loan_payment.bucket : Factor w/ 2 levels "(-0.01,0.5]",..: 1 1 1 1 1 1 1 1 1 1 ...
There are 113937 observations and 86 variables. There are a lot of factor variables and below is a list of factor variables I will use for the further analysis.
CreditGrade, Term, LoanStatus, ProsperScore, BorrowerState, Occupation, EmploymentStatus, IncomeRange
I am interested in finding which variable affects the credit score. And I will find if there is any correlation between income related variables and Loan Status.
People have different interest rates, monthly income, monthly loan payment, etc. These variables might affect their capabilities of repaying their loans and their credit scores and prosper scores. And depending on their employment status, occupations and credit scores, their loan amount may be different. For instance, I am expecting that people with low credit scores will have higher interest rates.
I created several new variables: BorrowerAPR.bucket, EstimatedReturn.bucket, CreditScoreRangeMid and ratio monthly loan payment. BorrowerAPR.bucket and EstimatedReturn.bucket are divided into intervals so that it can be used as factor variables. CreditScoreRangeMid is the average of CreditScoreRangeLower and CreditScoreRangeUpper. Ratio monthly loan payment is the ratio of monthly loan to monthly income.
After plotting many histograms from various variables, a consistent pattern can be found in the histograms. Some of them have an unusually high peak at the right of the median. I cutted such variables into distinct intervals so that I can use them with multivariable analysis.
## [1] "Term" "LoanStatus"
## [3] "BorrowerRate" "EstimatedReturn"
## [5] "ProsperScore" "TotalCreditLinespast7years"
## [7] "StatedMonthlyIncome" "LoanOriginalAmount"
## [9] "MonthlyLoanPayment" "ProsperScore.numeric"
## [11] "CreditScoreRangeMid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.028 0.048 8.058 0.077 12570.000
According to the matrix, BorrowerRate and Estimated return are positively correlated. And there is interesting thing going on with ProsperScore and BorrowerRate. There is strong positive correlation between LoanOriginalAmount and Monthly Loan Payment.
## Warning: Removed 591 rows containing missing values (geom_point).
As expected, Credit Score and BorrowerRate are negatively correlated.
##
## Pearson's product-moment correlation
##
## data: subset(df, CreditScoreRangeMid > 400)$BorrowerRate and subset(df, CreditScoreRangeMid > 400)$CreditScoreRangeMid
## t = -188.08, df = 113210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4923486 -0.4834719
## sample estimates:
## cor
## -0.4879229
Outliers are removed. Correlation test shows a moderate value of negative correlation.
NA values in ProsperScore can be ignored. And a clear pattern can be observed in this box plot. As score gets higher, borrower rate gets lower.
Clear linear pattern can be observed from the graph. Interesting thing is that 3 distinct linear lines can be seen.
##
## Call:
## lm(formula = MonthlyLoanPayment ~ LoanOriginalAmount, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -751.60 -23.45 -1.04 25.94 1499.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.275e+01 3.452e-01 94.9 <2e-16 ***
## LoanOriginalAmount 2.875e-02 3.313e-05 867.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.85 on 113935 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8686
## F-statistic: 7.531e+05 on 1 and 113935 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: df$MonthlyLoanPayment and df$LoanOriginalAmount
## t = 867.82, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9312165 0.9327426
## sample estimates:
## cor
## 0.9319837
As a result, these two variables have very high values of R^2 and correlation. This is reasonable because those with more loan amount have to pay more.
Due to huge outlier in ratio monthly loan payment variable, the graphs in the matrix don’t provide good information.
Outliers are removed and alpha is set to 0.1 and the data is subsetted to get a better graph.
The graph between ratio monthly loan payment and EstimatedReturn variables shows no obvious pattern. Ratio tends to be slightly higher at 0.10 EstimatedReturn but the trend is way too subtle.
## df$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03044 0.05301 0.30700 0.09036 192.30000
## --------------------------------------------------------
## df$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.025 0.045 10.420 0.077 12570.000
## --------------------------------------------------------
## df$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0374 0.0566 0.1216 0.0768 662.6000
## Source: local data frame [3 x 2]
##
## Term n
## (fctr) (int)
## 1 12 1608
## 2 36 87387
## 3 60 24521
Although median is the largest when the Term is 60, the mean is the largest when the Term is 36.
## df$LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01334 0.01504 0.01559 0.02049 0.02908
## --------------------------------------------------------
## df$LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.027 0.046 28.450 0.079 12570.000
## --------------------------------------------------------
## df$LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.020 0.038 10.070 0.066 12250.000
## --------------------------------------------------------
## df$LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0346 0.0561 0.1050 0.0807 662.6000
## --------------------------------------------------------
## df$LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.026 0.044 36.730 0.079 12250.000
## --------------------------------------------------------
## df$LoanStatus: FinalPaymentInProgress
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02599 0.04673 0.05682 0.07545 0.35570
## --------------------------------------------------------
## df$LoanStatus: Past Due (>120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03811 0.06871 0.07351 0.07901 0.22600
## --------------------------------------------------------
## df$LoanStatus: Past Due (1-15 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03305 0.05222 0.06521 0.07848 2.08000
## --------------------------------------------------------
## df$LoanStatus: Past Due (16-30 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0305 0.0511 2.6280 0.0785 679.8000
## --------------------------------------------------------
## df$LoanStatus: Past Due (31-60 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0341 0.0532 5.7760 0.0802 2073.0000
## --------------------------------------------------------
## df$LoanStatus: Past Due (61-90 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03046 0.04890 0.55540 0.07273 150.90000
## --------------------------------------------------------
## df$LoanStatus: Past Due (91-120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03458 0.05538 0.06728 0.08304 1.03700
## df$LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2445 2600 2609 3833 4167
## --------------------------------------------------------
## df$LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3750 4486 5500 208300
## --------------------------------------------------------
## df$LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2917 4417 5325 6583 618500
## --------------------------------------------------------
## df$LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3667 5167 6153 7447 1750000
## --------------------------------------------------------
## df$LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3708 4367 5417 58620
## --------------------------------------------------------
## df$LoanStatus: FinalPaymentInProgress
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1167 3583 5250 6312 8333 32920
## --------------------------------------------------------
## df$LoanStatus: Past Due (>120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3115 3750 3727 4500 6667
## --------------------------------------------------------
## df$LoanStatus: Past Due (1-15 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3167 4667 5554 6948 35420
## --------------------------------------------------------
## df$LoanStatus: Past Due (16-30 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3250 4583 5484 6500 30000
## --------------------------------------------------------
## df$LoanStatus: Past Due (31-60 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2938 4583 5436 7083 25000
## --------------------------------------------------------
## df$LoanStatus: Past Due (61-90 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3167 4583 5323 6594 31250
## --------------------------------------------------------
## df$LoanStatus: Past Due (91-120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3073 4171 4816 5833 22920
Completed people have greater incomes than Defaulted people.
## df$LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1075 0.1395 0.2000 0.1844 0.2375 0.2375
## --------------------------------------------------------
## df$LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1769 0.2400 0.2354 0.2975 0.4500
## --------------------------------------------------------
## df$LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1173 0.1744 0.1864 0.2511 0.4975
## --------------------------------------------------------
## df$LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0577 0.1314 0.1760 0.1838 0.2310 0.3304
## --------------------------------------------------------
## df$LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1650 0.2296 0.2231 0.2875 0.4975
## --------------------------------------------------------
## df$LoanStatus: FinalPaymentInProgress
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0629 0.1299 0.1899 0.1970 0.2712 0.3199
## --------------------------------------------------------
## df$LoanStatus: Past Due (>120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1449 0.2079 0.2551 0.2527 0.3060 0.3199
## --------------------------------------------------------
## df$LoanStatus: Past Due (1-15 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0749 0.1870 0.2317 0.2308 0.2859 0.3435
## --------------------------------------------------------
## df$LoanStatus: Past Due (16-30 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0599 0.1899 0.2419 0.2353 0.2909 0.3304
## --------------------------------------------------------
## df$LoanStatus: Past Due (31-60 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0649 0.1855 0.2468 0.2330 0.2870 0.3304
## --------------------------------------------------------
## df$LoanStatus: Past Due (61-90 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0659 0.1914 0.2468 0.2400 0.2999 0.3304
## --------------------------------------------------------
## df$LoanStatus: Past Due (91-120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0766 0.1850 0.2495 0.2383 0.2952 0.3435
People with less borrower rates complete their loan payments better than those with higher rates.
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02088 0.03460 0.04419 0.05513 0.34490
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02549 0.04036 0.04934 0.06350 0.44350
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03029 0.05062 0.05840 0.07790 0.48680
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03484 0.05790 0.06302 0.08399 0.46500
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03231 0.05468 0.06244 0.08163 0.49300
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03390 0.05604 0.06336 0.08301 0.47510
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03322 0.05505 0.06098 0.08076 0.45510
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03074 0.05130 0.05757 0.07677 0.47470
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02521 0.04286 0.04827 0.06629 0.42360
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02329 0.04147 0.04665 0.06552 0.42300
## --------------------------------------------------------
## subset(df, ratio_monthly_loan_payment <= 0.5 & !is.na(ProsperScore))$ProsperScore: 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002689 0.031750 0.051300 0.053250 0.072210 0.122800
ProsperScores of 5 and 6 tend to have a higher ratio and it gets lower as the score gets farther away from 5 and 6.
People with 700 credit score tend to have the highest ratio.
##
## 19 379 439 459 479 499 519 539 559 579 599 619
## 133 1 5 36 141 346 554 1593 1474 1357 1125 3602
## 639 659 679 699 719 739 759 779 799 819 839 859
## 4172 12199 16366 16492 15471 12923 9267 6606 4624 2644 1409 567
## 879 899
## 212 27
Interesting thing here is that the credit score is increasing until the prosper score 10 and it decreases when the prosper score is 11. People with really nice credit scores have a prosper score of 10.
This is a graph of Occupation and Loan amount. Since there are too many occupations and these occupations have to be grouped so that more clear pattern can be observed. But I decided not to use Occupation this time.
##
## Call:
## lm(formula = LoanOriginalAmount ~ CreditScoreRangeMid, data = subset(df,
## CreditScoreRangeMid > 350))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14210 -4136 -1306 3157 25449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.625e+04 1.953e+02 -83.23 <2e-16 ***
## CreditScoreRangeMid 3.537e+01 2.795e-01 126.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5850 on 113211 degrees of freedom
## Multiple R-squared: 0.1239, Adjusted R-squared: 0.1239
## F-statistic: 1.602e+04 on 1 and 113211 DF, p-value: < 2.2e-16
R^2 is very small so it is difficult to say there exists a linear relationship between this two variables, but generally people with good credit score have greater loan amounts.
##
## Call:
## lm(formula = log(subset(df, StatedMonthlyIncome > 10)$StatedMonthlyIncome) ~
## subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1391 -0.3450 0.0179 0.3694 5.8006
##
## Coefficients:
## Estimate
## (Intercept) 6.882e+00
## subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid 2.258e-03
## Std. Error
## (Intercept) 1.914e-02
## subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid 2.741e-05
## t value Pr(>|t|)
## (Intercept) 359.60 <2e-16
## subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid 82.37 <2e-16
##
## (Intercept) ***
## subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6073 on 111650 degrees of freedom
## (578 observations deleted due to missingness)
## Multiple R-squared: 0.05728, Adjusted R-squared: 0.05727
## F-statistic: 6784 on 1 and 111650 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: subset(df, StatedMonthlyIncome > 10)$StatedMonthlyIncome and subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid
## t = 36.827, df = 111650, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1037514 0.1153419
## sample estimates:
## cor
## 0.1095504
##
## Pearson's product-moment correlation
##
## data: log(subset(df, StatedMonthlyIncome > 10)$StatedMonthlyIncome) and subset(df, StatedMonthlyIncome > 10)$CreditScoreRangeMid
## t = 82.366, df = 111650, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2337985 0.2448578
## sample estimates:
## cor
## 0.2393359
CreditScoreRangeMid vs. StatedMonthlyIncome was graphed without any modification. And then log of monthly income is graphed. The latter showed more linearlized graph and the correlation was higher.
##
## Call:
## lm(formula = LoanOriginalAmount ~ StatedMonthlyIncome, data = subset(df,
## StatedMonthlyIncome > 10))
##
## Residuals:
## Min 1Q Median 3Q Max
## -294921 -4585 -1890 3305 26194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.418e+03 2.294e+01 323.4 <2e-16 ***
## StatedMonthlyIncome 1.666e-01 2.435e-03 68.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6121 on 112228 degrees of freedom
## Multiple R-squared: 0.04002, Adjusted R-squared: 0.04001
## F-statistic: 4678 on 1 and 112228 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: subset(df, StatedMonthlyIncome > 10)$LoanOriginalAmount and log(subset(df, StatedMonthlyIncome > 10)$StatedMonthlyIncome)
## t = 155.04, df = 112230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4151712 0.4248083
## sample estimates:
## cor
## 0.4200016
There is a linear relationship between the log of monthly income and loan amount. There are clear horizontal lines at multiple of 5000. I believe when the data is gathered the loan amounts are rounded. People whose monthly income is less than 8000 don’t usually have laon amounts greater than 25,000.
From the plots above, people with lower borrower rates and higher incomes tend to complete their loan payments better than those with higher borrower rates and lower incomes.
I created a new variable, the ratio of monthly loan payment to monthly income. I was interested in this rate because it can be an important factor to people when they repay their loans. Students can have lower incomes if they don’t get jobs they expected to get after graduation. In this case, a great portion of their incomes has to be paid as their loans. I thought that I can get a slight idea of unemployemnt rates from this analysis.
As a result, people who completed repaying their loans have lower rate than people who defaulted their loans. And people with the highest Prosper Score and Credit Score have lower ratio.
I thought that when a borrower rate is high then the return rate will be lower, but the opposite phenomenon can be seen from the graph. They are showing the positive correlation.
And I thought that the ratio of monthly loan payment to monthly income will be greater for those with better credit scores. However, the ratio is the greatest when the score is 5 or 6. Then I analyzed the relationship between the score and the monthly income. As a result, those with greater income amounts have better credit scores. This explains the ratio at 5 or 6 is greater than higher scores because people with a good income don’t increase their loan payment amount.
The strongest relationship I found was Monthly Loan Payment and Loan Amount. And the Borrower Rate and Prosper Score also showed a strong relationship.
I plotted ratio monthly loan payment of credit score for each BorrowerAPR. Good prosper scores can be found in the range of 0.0571 to 0.108 of a borrower rate. And they also have a good credit scores. As the borrower rate increases, the prosper score and credit score decrease.
Note that those with ProsperScore of 11 usually are on the lines that have slope of 1/30 or 1/50.
The black colored dots are NA of ProsperScore. So I will get rid of them.
Look at Completed and Defaulted. Most of the people with ProsperScore of 11 are in completed status.
The graph is now faceted by BorrowerAPR. More obvious pattern can be observed. People with good ProsperScore are in the ranges of the low rates and those with bad score are in the ranges of the high rates. As the Borrower APR gets larger, the score gets lower.
From the previous plot, the positive correlation between CreditScore and ProsperScore is already observed. What’s interesting to note in this graph is that those with good prosperScore don’t usually pay more than 20% of their income for their loan payments.
Chargedoff, Complted and Defaulted are similar. Most of them are Full time. Interesting thing to note is that most of the people who is paying their loans are self-employed.
##
## (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03]
## 1984 6362 16141 18155
## (4e+03,5e+03] (5e+03,6e+03] (6e+03,7e+03] (7e+03,8e+03]
## 19727 12859 10361 7750
## (8e+03,1.75e+06]
## 19203
The StatedMonthlyIncome is cutted into 9 intervals.
This graph also shows that people with high income levels have high Prosper Score and low Borrower Rate.
## Source: local data frame [68 x 3]
##
## Occupation count CreditScoreRangeMid_mean
## (fctr) (int) (dbl)
## 1 Judge 22 734.9545
## 2 Pharmacist 257 729.6556
## 3 Doctor 494 727.1113
## 4 Investor 214 724.5467
## 5 Attorney 1046 718.7161
## 6 Pilot - Private/Commercial 199 718.6457
## 7 Dentist 68 718.6176
## 8 Professor 557 717.4354
## 9 Principal 312 716.5513
## 10 Engineer - Electrical 1125 716.3800
## .. ... ... ...
## Source: local data frame [68 x 3]
##
## Occupation count CreditScoreRangeMid_mean
## (fctr) (int) (dbl)
## 1 Student - College Sophomore 69 628.3406
## 2 Student - Technical School 16 640.7500
## 3 Student - Community College 28 641.6429
## 4 Student - College Freshman 41 645.1098
## 5 Student - College Junior 112 651.6429
## 6 Homemaker 120 664.3333
## 7 Clerical 3164 669.5759
## 8 Waiter/Waitress 436 671.1055
## 9 Student - College Senior 188 672.6915
## 10 Laborer 1595 678.8542
## .. ... ... ...
There are 68 different occupations in the data set. According to the table above the counts for some occupations are less than 1000. I will filter out such people whose count is less than 1000.
## Source: local data frame [27 x 3]
##
## Occupation count CreditScoreRangeMid_mean
## (fctr) (int) (dbl)
## 1 Attorney 1046 718.7161
## 2 Engineer - Electrical 1125 716.3800
## 3 Executive 4311 712.3810
## 4 Engineer - Mechanical 1406 709.8841
## 5 Computer Programmer 4478 708.2182
## 6 Nurse (RN) 2489 707.2340
## 7 Accountant/CPA 3233 705.3862
## 8 Police Officer/Correction Officer 1578 700.9702
## 9 Teacher 3759 699.7580
## 10 Professional 13628 699.5719
## .. ... ... ...
## Source: local data frame [27 x 3]
##
## Occupation count CreditScoreRangeMid_mean
## (fctr) (int) (dbl)
## 1 Clerical 3164 669.5759
## 2 Laborer 1595 678.8542
## 3 Food Service 1123 679.4733
## 4 Sales - Retail 2797 681.7703
## 5 Military Enlisted 1272 681.8585
## 6 Administrative Assistant 3688 683.1388
## 7 Retail Management 2602 689.6076
## 8 Truck Driver 1675 690.6343
## 9 Sales - Commission 3446 690.6782
## 10 Skilled Labor 2746 690.8474
## .. ... ... ...
Top 5 and bottom 5 are chosen for the further analysis.
ProsperScore vs. BorrowerRate is plotted with color of Occupations. Unfortunately, too many ProsperScores are NA so this plot cannot be used for the analysis.
##
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
There are 9 different types of employment status including NA.
full-time people are at the right bottom corner and Self-employed people are at the left hand side.
From now on I will ignore “Employed” status because it is actually including everything except “Retired” status. From the graph, most of “Full-time” people have good prosper scores and low borrower rate and most of “self-employed” and “Not employed” people have bad prosper scores and high borrower rates. It is interesting to see that “retired” and “Part-time” tend to have good prosper scores and low borrower rates.
StatedMonthly vs. MonthlyLoanPayment is graphed again with EmploymentStatus coloring the points. I couldn’t help but notice that there are some people who have 0 income and paying MonthlyLoanPayment. I will investigate these people more.
This is jitter plot where StatedMonthlyIncome is 0.
Most of the people don’t specify their prosper score so I will focus on “Not employed” because they have the greatest number of data that are not NA.
These are the “Not employed” people with 0 StatedMonthlyIncome. As expected their scores are low.
It is interesting to see that a lot of them actually completed their loan payment despite of their 0 income.
Based on the observations so far, I can build a model using, OriginalLoanAmount, MonthlyLoanPayment, ProsperScore, Borrower Rate, EmploymentStatus and CreditScore.
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'memisc'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:dplyr':
##
## collect, query, rename
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
##
## Calls:
## m1: lm(formula = ProsperRating..numeric. ~ BorrowerRate, data = df_reduced)
## m2: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount,
## data = df_reduced)
## m3: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount +
## log(MonthlyLoanPayment), data = df_reduced)
## m4: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount +
## log(MonthlyLoanPayment) + CreditScoreRangeMid, data = df_reduced)
## m5: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount +
## log(MonthlyLoanPayment) + CreditScoreRangeMid + EmploymentStatus,
## data = df_reduced)
##
## ========================================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------------------
## (Intercept) 8.267*** 8.094*** 8.632*** 6.374*** 6.234***
## (0.005) (0.007) (0.026) (0.041) (0.041)
## BorrowerRate -21.394*** -21.014*** -20.946*** -20.073*** -20.017***
## (0.023) (0.025) (0.026) (0.028) (0.028)
## LoanOriginalAmount 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000)
## log(MonthlyLoanPayment) -0.119*** -0.114*** -0.095***
## (0.006) (0.005) (0.005)
## CreditScoreRangeMid 0.003*** 0.003***
## (0.000) (0.000)
## EmploymentStatus: Full-time/Employed 0.111***
## (0.006)
## EmploymentStatus: Not employed/Employed -0.027
## (0.019)
## EmploymentStatus: Other/Employed -0.085***
## (0.008)
## EmploymentStatus: Part-time/Employed 0.080*
## (0.031)
## EmploymentStatus: Retired/Employed 0.072**
## (0.026)
## EmploymentStatus: Self-employed/Employed -0.132***
## (0.007)
## --------------------------------------------------------------------------------------------------------
## R-squared 0.909 0.910 0.911 0.916 0.917
## adj. R-squared 0.909 0.910 0.911 0.916 0.917
## sigma 0.505 0.501 0.500 0.485 0.483
## F 841017.986 427508.345 286711.241 229104.529 92654.177
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -62027.080 -61393.349 -61163.761 -58721.850 -58294.576
## Deviance 21493.951 21173.415 21058.475 19873.908 19673.596
## AIC 124060.161 122794.699 122337.522 117455.699 116613.151
## BIC 124088.189 122832.070 122384.236 117511.756 116725.265
## N 84356 84356 84356 84356 84356
## ========================================================================================================
I found that those with good Prosper Scores tend to complete their payments better than those with bad Prosper Scores. So I decided to focus on finding which variables affect the prosper scores. I found that there monthly income, credit score, original loan amount, monthly loan payment are positively correlated to Prosper Score and Borrower Rate is negatively correlated to Prosper Score. In other words, the capability of completing the loan payments depend on these variables.
The most interesting thing I found was that that are many people who pay their loans with 0 income. And it was also interesting to find out that most of them actually completed their loan payments. Another interesting interaction was between the credit score and ratio between monthly loan payment to monthly income. People in mid range of credit scores have the highest ratio. People
I created a model starting from ProsperRating..numeric. and BorrowerRate. ProsperRating..numeric. is used instaed of ProsperScore because it is a numeric variable while ProsperScore is a factor variable. Then I added 4 more variables: LoanOriginalAmount, log(monthlyLoanPayment), CreditScoreRangeMid and EmploymentStatus. R-squared was 0.909 at first. When more variables are added to the model, it gets increased until 0.917.
## Warning: Removed 29084 rows containing non-finite values (stat_bin).
This is a histogram of ProsperScore variable. It looks similar to the normal distribution graph. Most of the people have ProsperScore between 4 and 8 and a small number of poeple are outside this range.
These 3 graphs show that ProsperScore, BorrowerRate and CreditScore are somewhat correlated. There are negative correlations between ProsperScore and BorrowerRate and between CreditScore and BorrowerRate. And there is a positive correlation between CreditScore and ProsperScore.
Through this graph, the linear model can be constructed using ProsperScore, BorrowerRate and Monthly Income. This graph shows the that people with good ProsperScore and low BorrowerRate have large monthly income and people with bad ProsperScore and high BorrowerRate have low monthly income.
There are total 113937 records with 81 variables in the data. I first filtered out the variables with too many NA values. Then, I analyzed each variable by creating a histogram and try to find variables that are important to this data set. There are some variables that got my attention: BorrowerRate, BorrowerAPR, CreditScore, ProsperScore, etc. I put the most emphasis on ProsperScore because according to the description of the data set, ProsperScore indicates the risk of repaying loans where 10 means the lowest risk and 0 means the highest risk. Since there are total 81 variables in the data set there are so many interesting variables I wanted to investigate further such as listing category that indicates the types of loan and BorrowerState that indicates the state of the address of the borrower. However, I decided to focus on finding the variables that affect the ProsperScore.
During the investigation, I created 6 more variables to the data: BorrowerAPR.bucket, EstimatedReturn.bucket, ProsperScore.numeric, CreditScoreRangeMid, ratio monthly loan payment and ratio monthly loan payment.bucket. I used BorrowerAPR and BorrowerRate interchageably because they are almost the same.
## Warning: Removed 25 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: df$BorrowerRate and df$BorrowerAPR
## t = 2347.7, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9897057 0.9899409
## sample estimates:
## cor
## 0.989824
As the above plot and calculation indicate, the correlation is very close to 1. I used BorrowerAPR for bucket in order to use it in multivariate analysis.
While the univariate analysis provides me which variables would be interesting to use, bivariate analysis provides the relationship between pairs of data variables I chose to use. One interesting I found during the analysis was that people who have good ProsperScore tends to have low BorrowerRate and high credit score. And people who pay their loans using higher ratio to their monthly income usually have middle range of credit scores and prosper scores. Those with high credit scores and prosper scores don’t use high ratio. Later investigation showed that although people with high ratio and with low ratio pay similar amounts of loan payments those with high prosper score usually have higher monthly income. So, people with high income and high prosper score not neccessarily pay more loans.
There are few problems I confronted during the analysis. When using MonthlyIncome variable, there a lot of outliers that are way beyond the average and median of the data. At first I filtered those outliers and I realized that I have to get rid of too many of the data. So I used log10 and it provided the better histogram and plot of the data.
As I mentioned there are still a lot of interesting variables that can provide interesting results from further analysis. I would be interesting to find which state has the greatest loan amount borrowed and highest percentage of completing their loans. And I would like to find which type of loan has the highest percentage of completion and highest average mean and median of prosper score and Credit score.